The aim of this project is to be able to predict a player’s win percentage based on their individual performance.
library(tidymodels)
library(tidyverse)
library(ISLR)
library(ISLR2)
library(discrim)
library(poissonreg)
library(corrr)
library(klaR)
library(ggplot2)
library(ggthemes)
library(pROC)
library(janitor)
library(corrplot)
library(rpart.plot)
library(vip)
library(randomForest)
library(xgboost)
library(yardstick)
tidymodels_prefer()
# These are all the libraries I used to make this project happen!
Valorant is a free-to-play, first-person, tactical shooter published by Riot Games. Although officially released in June 2020, the development of the game started in 2014 and did not start their closed beta access until April 2020. Valorant is set in the near future, where players play as one of 20 “agents” – characters based on several countries and cultures around the world. In the main game mode, players are assigned to a team of 5, either attacking or defending, with each team aiming to be the first to win 13 rounds. Not only is there a halftime to switch sides, but there is an overtime if the teams end up in a tie. Agents have unique abilities, each requiring charges, as well as a unique ultimate ability that requires charging through kills, deaths, orbs, or objectives.
As this game is still relatively young, it can easily be seen that this game is also rapidly changing. Due to factors such as new maps, new agents, or updates that can buff or nerf certain aspects of this game, the meta of Valorant is ever-changing. Moreover, as Valorant is establishing its name as the next up and coming e-sport, professional players are dissecting every bit of the game to get the edge over their opponents. Through this project, we can do a deep dive of which factors lead to a higher win percentage and what trends are currently helping players dominate the game.
With all of that settled, let’s cover exactly how we’re going to accomplish this. First comes some data clean up and manipulation. Secondly, we’ll get into some exploratory data analysis – seeing the distribution of our outcome variable “win_percent” and the relationships among the many predictor variables. Next, we’ll follow up with our models, specifically linear regression, decision tree, random forest, and boosted tree. After fitting these models to the training set, we will fit the best performing model to the testing data.
Now that we all have a basic understanding of the game, let’s get started!
Before we can even modify the data set, we have to read it in! Although the data was already pretty organized and neat, let’s use the clean_names function to ensure it.
valData <- read.csv('Data/val_stats.csv')
valorant <- valData %>% clean_names() # lower cases all variables, replaces spaces with underscores, etc.
Let’s remove the predictor variables “region”, “name”, and “tag” as they are just characteristics to distinguish the players and not so relevant with predicting win percentage.
valorant <- valorant %>%
select(-region, -name, -tag)
We’ll remove the extremities, as they were invalid entries due to either incorrect input or containing a missing value in general.
valorant <- valorant %>%
filter(win_percent != 0,
win_percent != 100,
rating == 'Unrated' | rating == 'Bronze 2' | rating == 'Bronze 3' |
rating == 'Silver 1' | rating == 'Silver 2' | rating == 'Silver 3' |
rating == 'Gold 1' | rating == 'Gold 2' | rating == 'Gold 3' |
rating == 'Platinum 1' | rating == 'Platinum 2' | rating == 'Platinum 3' |
rating == 'Diamond 1' | rating == 'Diamond 2' | rating == 'Diamond 3' |
rating == 'Immortal 1' | rating == 'Immortal 2' | rating == 'Immortal 3' |
rating == 'Radiant')
Here we turn all the nominal variables into factors so that they can be applied to our models.
valorant$agent_1 <- factor(valorant$agent_1)
valorant$agent_2 <- factor(valorant$agent_2)
valorant$agent_3 <- factor(valorant$agent_3)
valorant$gun1_name <- factor(valorant$gun1_name)
valorant$gun2_name <- factor(valorant$gun2_name)
valorant$gun3_name <- factor(valorant$gun3_name)
Same thing for the “rating” variable, and let’s organize the ranks to match how it’s organized in Valorant.
valorant$rating <- factor(valorant$rating,
levels = c('Unrated', 'Bronze 2', 'Bronze 3',
'Silver 1', 'Silver 2', 'Silver 3',
'Gold 1', 'Gold 2', 'Gold 3',
'Platinum 1', 'Platinum 2', 'Platinum 3',
'Diamond 1', 'Diamond 2', 'Diamond 3',
'Immortal 1', 'Immortal 2', 'Immortal 3',
'Radiant'))
These variables were read in as character variables due to the comma in the number, so let’s set them to their appropriate class.
valorant$gun1_kills <- as.numeric(gsub(',', '', valorant$gun1_kills))
valorant$gun2_kills <- as.numeric(gsub(',', '', valorant$gun2_kills))
valorant$first_bloods <- as.numeric(gsub(',', '', valorant$first_bloods))
valorant$kills <- as.numeric(gsub(',', '', valorant$kills))
valorant$deaths <- as.numeric(gsub(',', '', valorant$deaths))
valorant$assists <- as.numeric(gsub(',', '', valorant$assists))
valorant$headshots <- as.numeric(gsub(',', '', valorant$headshots))
That does it for the data clean up! Let’s set the size to lessen the run-time and set the seed to get consistent results!
set.seed(1)
valorant <- sample_n(valorant, size = 10000)
To start of the EDA, we’ll take a quick look at the data to see if we missed anything in the data clean up.
head(valorant)
## rating damage_round headshots headshot_percent aces clutches flawless
## 1 Immortal 1 124.1 51 14.3 0 18 2
## 2 Immortal 3 151.1 1309 29.5 3 183 84
## 3 Immortal 1 129.8 92 22.5 0 17 10
## 4 Immortal 1 120.5 471 20.6 0 95 42
## 5 Immortal 1 142.1 865 20.2 2 150 87
## 6 Immortal 1 123.8 643 41.5 0 89 39
## first_bloods kills deaths assists kd_ratio kills_round most_kills score_round
## 1 20 109 140 51 0.78 0.6 17 187.6
## 2 255 1712 1690 421 1.01 0.8 29 228.3
## 3 19 151 176 51 0.86 0.7 20 193.7
## 4 65 747 875 406 0.85 0.6 26 181.2
## 5 193 1396 1324 635 1.05 0.7 29 217.6
## 6 84 730 853 266 0.86 0.6 29 185.3
## wins win_percent agent_1 agent_2 agent_3 gun1_name gun1_head gun1_body
## 1 4 44.4 Jett Chamber Killjoy Vandal 25 71
## 2 51 51.0 Reyna Raze Skye Vandal 41 55
## 3 4 40.0 Sova Brimstone Viper Vandal 29 65
## 4 33 58.9 Viper Astra Sova Vandal 30 66
## 5 55 61.1 Reyna Sage Brimstone Phantom 31 64
## 6 25 46.3 Astra Reyna Sova Vandal 66 32
## gun1_legs gun1_kills gun2_name gun2_head gun2_body gun2_legs gun2_kills
## 1 4 45 Phantom 14 83 3 14
## 2 3 1099 Phantom 40 57 3 170
## 3 6 95 Phantom 23 74 3 12
## 4 5 341 Phantom 32 62 6 171
## 5 6 506 Vandal 28 69 4 295
## 6 1 366 Phantom 48 50 2 151
## gun3_name gun3_head gun3_body gun3_legs gun3_kills
## 1 Ghost 23 61 16 11
## 2 Ghost 41 57 2 142
## 3 Spectre 19 69 11 11
## 4 Spectre 20 76 5 89
## 5 Spectre 21 72 7 172
## 6 Spectre 42 53 5 58
Everything looks good here!
dim(valorant)
## [1] 10000 35
Being that we started with 38 total variables and then removed “region”, “name”, and “tag”, we correctly have 35 variables left. We can see that we have the correct amount of observations specified in our sample_n function! Let’s see how well the numerical variables correlate with each other…
valorant %>%
select(where(is.numeric)) %>%
cor() %>%
corrplot(type = 'full', diag = FALSE, method = 'square', order = 'AOE',
tl.col = 'orange', col = COL2('PuOr'))
As we can see, majority of the variables actually go hand-in-hand with
one another, with the exception of some negative correlations like
gun1_head and gun1_body for example – but that’s to be expected. To no
surprise, kills are highly correlated with variables like “clutches”,
“flawless”, and “deaths”; but it came as a shock to me to see that there
was actually a negative correlation with variables such as “gun2_head”
and “win_percent”! Speaking of “win_percent”, let’s inspect the
distribution for it – being that it’s the variable we’re predicting.
valorant %>%
ggplot(aes(win_percent)) +
geom_histogram(bins = 50)
It can be observed that our response variable, “win_percent”, has a
normal distribution – with majority of the players’ win percentages
being around 50%. We have a small amount of players winning a very high
amount of their games this act, sitting at about the 90% area.
Alternatively, we have some players that are not doing so hot, sitting
under the 25% area. Maybe it has something to do with the agents they
are selecting? Let’s take a look.
valorant %>%
ggplot(aes(y = agent_1)) +
geom_bar()
As of this act, we can see that the leading and most trending agents are
Chamber, Reyna, and Jett. Having personal experience playing this game,
I can tell you that this is a very accurate depiction of the current
state of the game. These agents have lots of unique abilities and traits
that help their team win, especially if played effectively and
correctly. But agents don’t define the game entirely… how about the
weapons that the players are choosing?
valorant %>%
ggplot(aes(y = gun1_name)) +
geom_bar()
With no surprise, the two most popular guns are the Vandal and Phantom.
It is still very up in the air about which gun is objectively better,
but I would assume the Vandal gets a lot more usage due to the fact that
it can hit a one-shot headshot and a phantom cannot. The up-side to the
phantom however, is that it allows the user to be able to control the
spray of bullets, where as the Vandal is more effective with burst
shots.
valorant %>%
ggplot(aes(y = gun2_name)) +
geom_bar()
For the peoples’ second choice, we see that the Vandal and Phantom are
used as a back-up pick (maybe their luck isn’t working with their first
choice), but we also see that the Spectre, Operator, and Ghost are
popular options. The reason that these guns may be the second choice for
a lot of people can simply be due to economical reasons. The Operator,
being the most expensive weapon in the game, does not allow the players
to be able to buy it very often. With the Spectre, it is a very popular
choice in terms of being a cheap but effective weapon – so if a team has
to save up a little more money to be able to buy a Vandal or Phantom the
next round, they will opt for the Spectre. As for the Ghost, it is a
common weapon to hold during the pistol round – the first round of every
game and the first round after every half.
valorant %>%
ggplot(aes(y = gun3_name)) +
geom_bar()
Lastly, we look at the peoples’ third choice. We can infer that the
large variation of choices can be the result of a great amount of
factors, all the way from economical reasons (similar to the second
choice) to subjective preference, where people just find it fun to use
(after all it is a video game). Commonly picked as the third weapon, we
see the Spectre (probably having similar reason as when it was the
second choice), Phantom and Operator, and the three most popular
pistols: Sheriff, Classic, and Ghost. With such a large range of options
for a third choice, let’s look at how the players perform after choosing
such weapons…
valorant %>%
ggplot(aes(x = score_round, y = gun3_name)) +
geom_boxplot()
Notably in this graph, we can see that the average score per round
across all the levels of guns sits around the 220 area. Unexpectedly to
me, although not picked quite nearly as often, the average score per
round for people who chose either the Bulldog or Ares have a higher
average than those who chose the Spectre. This could simply be the
reason because there are not a lot of observations for these guns and
since the Spectre has a higher pick rate, the average is brought down.
So we’ve seen how dangerous the players can be, but just how precise is
their shot?
valorant %>%
ggplot(aes(x = headshot_percent, y = rating)) +
geom_boxplot()
From this graph, we can see that the higher ranks stay quite close to
each other at around a headshot percent of 25. Naturally, as we climb
the ranks, this percentage will increase, but actually it is striking to
me to see that Immortal 3 players have a higher average than Radiant
players (the top rank of the game – top 500 in the region). Although
this is the case, it can also be seen that Radiant players are more
consistent, since their Q1 and Q3 are closer to the mean than that of
the Immortal 3 players. Overall, these players are hitting about one
headshot per five shots total.
valorant %>%
ggplot(aes(x = damage_round, y = gun1_name)) +
geom_boxplot()
Let’s try to settle the debate ourselves, which gun is better… the
Vandal or the Phantom? As we can see in this box and whisker plot,
Vandal users are dealing a higher average amount of damage per round
than Phantom users. Similar to our Radiant and Immortal 3 comparison in
the last box and whisker plot, Phantom users are more consistent while
Vandal users cover a large range of damage per round. So I’ll ask you
then… would you rather have a more consistent player or one that can
either pop-off or perform underwhelmingly?
We have finally reached model building! As mentioned previously, I will be making a linear regression model, a decision tree, a random forest, and a boosted tree. First, we start off by splitting the data. I chose to do an 80/20 split between the training set and the testing set, and then also stratified it to the variable we are predicting: win_percent.
After splitting we can recognize that the training data correctly contains xxx observations (80% of observations in our sample size) and the testing data correctly contains xxx observations (20% of observations in our sample size).
valSplit <- initial_split(valorant, prop = 0.8, strata = win_percent)
valTrain <- training(valSplit)
valTest <- testing(valSplit)
Next I implement the k-fold cross-validation method by using 5 folds and stratifying it to our predicted variable.
valFold <- vfold_cv(valTrain, strata = win_percent, v = 5)
Following up the folds, I create the recipe that we are going to be adding to the future models.
valRecipe <- recipe(win_percent ~ rating + damage_round + headshots + headshot_percent +
aces + clutches + flawless + first_bloods +
kills + deaths + assists + kd_ratio +
kills_round + most_kills + score_round +
agent_1 + agent_2 + agent_3 +
gun1_name + gun1_head + gun1_body + gun1_legs + gun1_kills +
gun2_name + gun2_head + gun2_body + gun2_legs + gun2_kills +
gun3_name + gun3_head + gun3_body + gun3_legs + gun3_kills, data = valTrain) %>%
# predict win_percent with all variables
step_dummy(all_nominal_predictors()) %>% # dummy code nominal variables
step_other(agent_1, agent_2, agent_3, gun1_name, gun2_name, gun3_name) %>% # pool infrequently occurring values into an "other" category
step_novel(agent_1, agent_2, agent_3, gun1_name, gun2_name, gun3_name) %>% # assign a previously unseen factor level to a new value
step_center(all_numeric_predictors()) %>% # center all numeric predictors
step_scale(all_numeric_predictors()) %>% # scale all numeric predictors
step_pca(headshots, most_kills, kd_ratio,
gun1_head, gun1_body, gun1_legs,
gun2_head, gun2_body, gun2_legs,
gun3_head, gun3_body, gun3_legs,
gun1_kills, gun2_kills, gun3_kills,
kills, assists, deaths,
headshot_percent,
flawless, clutches, aces, first_bloods,
score_round, damage_round, kills_round,
num_comp = 3) %>% # convert numeric data into one or more principal components
step_nzv(all_predictors()) # remove variables that are highly sparse and unbalanced
valRecipe2 <- recipe(win_percent ~ rating + damage_round + headshots + headshot_percent +
aces + clutches + flawless + first_bloods +
kills + deaths + assists + kd_ratio +
kills_round + most_kills + score_round +
agent_1 + agent_2 + agent_3 +
gun1_name + gun1_head + gun1_body + gun1_legs + gun1_kills +
gun2_name + gun2_head + gun2_body + gun2_legs + gun2_kills +
gun3_name + gun3_head + gun3_body + gun3_legs + gun3_kills, data = valTrain) %>%
# predict win_percent with all variables
step_dummy(all_nominal_predictors()) %>% # dummy code nominal variables
step_other(agent_1, agent_2, agent_3, gun1_name, gun2_name, gun3_name) %>% # pool infrequently occurring values into an "other" category
step_novel(agent_1, agent_2, agent_3, gun1_name, gun2_name, gun3_name) %>% # assign a previously unseen factor level to a new value
step_center(all_predictors()) %>% # center all numeric predictors
step_scale(all_predictors()) %>% # scale all numeric predictors
step_pca(headshots, most_kills, kd_ratio,
gun1_head, gun1_body, gun1_legs,
gun2_head, gun2_body, gun2_legs,
gun3_head, gun3_body, gun3_legs,
gun1_kills, gun2_kills, gun3_kills,
kills, assists, deaths,
headshot_percent,
flawless, clutches, aces, first_bloods,
score_round, damage_round, kills_round,
num_comp = 3) %>% # convert numeric data into one or more principal components
step_nzv(all_predictors()) # remove variables that are highly sparse and unbalanced
Let’s get started with the first model…
lm_spec <- linear_reg() %>%
set_mode('regression') %>%
set_engine('lm')
# set the mode to regression and the engine to linear model
lm_fit <- fit(lm_spec, win_percent ~ rating + damage_round + headshots + headshot_percent +
aces + clutches + flawless + first_bloods +
kills + deaths + assists + kd_ratio +
kills_round + most_kills + score_round +
agent_1 + agent_2 + agent_3 +
gun1_name + gun1_head + gun1_body + gun1_legs + gun1_kills +
gun2_name + gun2_head + gun2_body + gun2_legs + gun2_kills +
gun3_name + gun3_head + gun3_body + gun3_legs + gun3_kills,
data = valTrain)
# nothing to complicated about linear regression, fit the specified model, add a formula, and fit to the training data
lm_results <- augment(lm_fit, new_data = valTrain) %>%
rmse(truth = win_percent, estimate = .pred)
lm_results
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 7.56
# How did it do?
The linear regression model acquired an RMSE of 7.26781, not bad but could be better.
Now for the decision tree!
tree_spec <- decision_tree() %>%
set_engine("rpart")
# Using the rpart engine to help make the decision tree
reg_tree_spec <- tree_spec %>%
set_mode("regression")
# Specify the tree as regression
reg_tree_fit <- fit(reg_tree_spec,
win_percent ~ rating + damage_round + headshots + headshot_percent +
aces + clutches + flawless + first_bloods +
kills + deaths + assists + kd_ratio +
kills_round + most_kills + score_round +
agent_1 + agent_2 + agent_3 +
gun1_name + gun1_head + gun1_body + gun1_legs + gun1_kills +
gun2_name + gun2_head + gun2_body + gun2_legs + gun2_kills +
gun3_name + gun3_head + gun3_body + gun3_legs + gun3_kills,
data = valTrain)
# Fit the model to the training data set
reg_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
# Plot the tree!
reg_tree_wf <- workflow() %>%
add_model(reg_tree_spec %>% set_args(cost_complexity = tune())) %>%
add_formula(win_percent ~ rating + damage_round + headshots + headshot_percent +
aces + clutches + flawless + first_bloods +
kills + deaths + assists + kd_ratio +
kills_round + most_kills + score_round +
agent_1 + agent_2 + agent_3 +
gun1_name + gun1_head + gun1_body + gun1_legs + gun1_kills +
gun2_name + gun2_head + gun2_body + gun2_legs + gun2_kills +
gun3_name + gun3_head + gun3_body + gun3_legs + gun3_kills)
# Create a workflow for the decision tree
param_grid <- grid_regular(cost_complexity(range = c(-4, -1)), levels = 10)
tune_res <- tune_grid(reg_tree_wf, resamples = valFold, grid = param_grid, metrics = metric_set(rmse, rsq)) # Trying to find the best value for the parameters
## ! Fold1: internal: A correlation computation is required, but `estimate` is constant and ha...
## ! Fold2: internal: A correlation computation is required, but `estimate` is constant and ha...
## ! Fold3: internal: A correlation computation is required, but `estimate` is constant and ha...
## ! Fold4: internal: A correlation computation is required, but `estimate` is constant and ha...
## ! Fold5: internal: A correlation computation is required, but `estimate` is constant and ha...
autoplot(tune_res)
It can be seen that as we increase the cost-complexity parameter, the
RSQ will eventually peak, but the RMSE is at its lowest point.
best_complexity <- select_best(tune_res, metric = "rmse") # Selecting the best value for RMSE
reg_tree_final <- finalize_workflow(reg_tree_wf, best_complexity) # Creating a workflow using the best parameter value
reg_tree_final_fit <- fit(reg_tree_final, data = valTrain) # Fitting the model onto the training data set
reg_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call rpart.plot with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
# Plot the tree again!
reg_tree_result <-augment(reg_tree_final_fit, new_data = valTest) %>%
rmse(truth = win_percent, estimate = .pred) # Let's check how well our model did
reg_tree_result
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 8.11
Our model got an RMSE of 9.295308 – this model does not seem that great.
Moving onto Random Forest!
rf_spec <- rand_forest(mtry = .cols()) %>%
set_engine("randomForest", importance = TRUE) %>%
set_mode("regression")
# Creating a Random Forest model
rf_fit <- fit(rf_spec, win_percent ~ rating + damage_round + headshots + headshot_percent +
aces + clutches + flawless + first_bloods +
kills + deaths + assists + kd_ratio +
kills_round + most_kills + score_round +
agent_1 + agent_2 + agent_3 +
gun1_name + gun1_head + gun1_body + gun1_legs + gun1_kills +
gun2_name + gun2_head + gun2_body + gun2_legs + gun2_kills +
gun3_name + gun3_head + gun3_body + gun3_legs + gun3_kills,
data = valTrain)
# Fit the model to the training data set
rf_result <- augment(rf_fit, new_data = valTest) %>%
rmse(truth = win_percent, estimate = .pred)
rf_result
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 7.65
augment(rf_fit, new_data = valTest) %>%
ggplot(aes(win_percent, .pred)) +
geom_abline() +
geom_point(alpha = 0.5)
# Let's see how well the model is able to predict the win_percent
By taking a look at the graph, it’s obvious that this model did not do a very good job at predicting the true value of win_percent.
vip(rf_fit) # Which predictors were most important at helping predict the true value?
In general, we can see that predictors such as “kd_ratio” and “rating”
were quite important in helping our model predict the win percentage of
a player. On the other hand, variables such as “kills” or “damage_round”
don’t necessarily mean a player is going to have a high win
percentage.
Last but not least… Boosted Tree!
boost_spec <- boost_tree(trees = 5000, tree_depth = 4) %>%
set_engine("xgboost") %>%
set_mode("regression")
boost_fit <- fit(boost_spec, win_percent ~ ., data = valTrain)
boost_result <-augment(boost_fit, new_data = valTest) %>%
rmse(truth = win_percent, estimate = .pred)
boost_result
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.76
Let’s put the results together to get an easier look of which model did the best shall we?
bind_rows(lm_results, reg_tree_result, rf_result, boost_result)
## # A tibble: 4 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 7.56
## 2 rmse standard 8.11
## 3 rmse standard 7.65
## 4 rmse standard 2.76
In conclusion, after our detailed analysis of Valorant, the model that performed the best at predicting the win percentage of a player given their individual statistics is Boosted Tree!
There are a lot of ways in which I could have improved upon in this project, such as building more models (Support Vector Machines, ridge regression, neural networks, etc.) or simply manipulating the variables a little bit more. As I worked through this project, a couple of my models failed and I ultimately was not able to pull results from it. Even more so, the models that were able run performed very poorly. As of right now, it is hard to determine what is the cause for the poor performance, but I think I can confidently say that this project did not have the results I wished for. Taking this project as a learning lesson, I hope to work on this project in the future and I hope to eventually get it right.
Overall, this project felt very rewarding to work on every bit of the way. It was a good chance to work on my skills and build some experience with machine learning and projects in general – and I got to work on a topic I am heavily interested in! Thank you for following along and I hope you enjoyed going through this project with me!
In the wise words of my random teammate, “gg go next.”